Genomic diversity of Helicobacter pylori populations from different regions of the human stomach

ABSTRACT Individuals infected with Helicobacter pylori harbor unique and diverse populations of quasispecies, but diversity between and within different regions of the human stomach and the process of bacterial adaptation to each location are not yet well understood. We applied whole-genome deep sequencing to characterize the within- and between-stomach region genetic diversity of H. pylori populations from paired antrum and corpus biopsies of 15 patients, along with single biopsies from one region of an additional 3 patients, by scanning allelic diversity. We combined population deep sequencing with more conventional sequencing of multiple H. pylori single colony isolates from individual biopsies to generate a unique dataset. Single colony isolates were used to validate the scanning allelic diversity pipelines. We detected extensive population allelic diversity within the different regions of each patient’s stomach. Diversity was most commonly found within non-coding, hypothetical, outer membrane, restriction modification system, virulence, lipopolysaccharide biosynthesis, efflux systems, and chemotaxis-associated genes. Antrum and corpus populations from the same patient grouped together phylogenetically, indicating that most patients were initially infected with a single strain, which then diversified. Single colonies from the antrum and corpus of the same patients grouped into distinct clades, suggesting mechanisms for within-location adaptation across multiple H. pylori isolates from different patients. The comparisons made available by combined sequencing and analysis of isolates and populations enabled comprehensive analysis of the genetic changes associated with H. pylori diversification and stomach region adaptation.


Introduction
Helicobacter pylori typically first colonizes people in early childhood and then persists as a chronic, lifelong infection 1 . This usually results in gastritis, which is most often asymptomatic in nature. 2 The chronic inflammatory nature of the infection, and the high mutation and recombination rate of this bacterium, are thought to contribute to a diverse bacterial population (quasispecies) within the infected gastric mucosa. Colonizing bacteria in the antrum and corpus regions of the human stomach will be exposed to different levels of acidity, inflammatory factors, access to sheltering glands and mucus. This variation in environmental conditions within stomachs may drive bacterial diversification over time. Particularly in high-prevalence areas, humans are colonized by multiple strains, 3 further increasing this diversity. The high level of genetic diversity of H. pylori has implications for the design of successful eradication regimens and vaccines. However, the extent and characteristics of H. pylori diversity within infected individuals are not yet fully understood.
Genomic approaches have highlighted the intimate relationship of H. pylori infection and coevolution with humans which has occurred at least since anatomically modern humans migrated out of Africa approximately 58,000 years ago. [4][5][6] Genomic analyses of H. pylori isolates across the world have revealed a global population structure and diversity. [7][8][9] Comparative genomics approaches have been taken, usually with single colony isolates from each patient, to investigate global population diversity in H. pylori. Other studies have characterized H. pylori strains from specific geographical regions. [10][11][12][13] These global and geographical genetic analyses have facilitated the identification of genotypes associated with different disease types and severity. For example, a recent genome-wide association study 14 (GWAS) identified SNPs and genes in H. pylori genomes that could be used to assess gastric cancer risk in infected people. A similar GWAS approach was undertaken to identify regions of the genome associated with the progression of duodenal ulceration to gastric cancer from East Asian (hspEAsia) H. pylori strains. 15 Another study identified six genes that were associated with peptic ulcer disease and gastric adenocarcinoma, by comparing multiethnic populations at high to low risk of these diseases. 16 Therefore, despite there being a very high level of diversity in H. pylori genomes, analysis at global and local geographical levels has provided important insights into bacterial virulence and disease progression. In addition, comparative genomic analysis of multiple single colony isolates from different regions of individual patient stomachs has informed our understanding of H. pylori niche adaptation and intragastric migration at the individual host level. 17 Ailloud et al (2019) 17 performed phylogenetic analyses on multiple H. pylori isolates from different stomach regions of 16 adult patients, and their findings suggested that pressures for genetic adaptation were different according to characteristics of the gastric niche, although there appeared to be migration of bacteria between regions of the stomach.
Studies of bacterial populations that rely on sequencing single colony isolates can only capture a subset of the genetic diversity, but bacterial diversity at the population level can be more comprehensively assessed using deep sequencing methodologies. Within-patient genetic diversity of Burkholderia dolosa from chronically infected individuals was previously investigated using a population deep sequencing approach, 18 where the reads mapped onto a reference genome identified the most variable genes and regions, and captured a snapshot of within-patient genetic diversity of the whole population. For this study, we adopted a similar approach 18 and adapted it to H. pylori populations from gastric biopsies taken from the antrum and corpus of infected patients. Single colonies were also isolated from a subset of these biopsies for conventional whole-genome sequencing. Here, we present a high-resolution analysis of H. pylori genetic structure and population diversity using these techniques in H. pylori populations for the first time.

Results
To characterize H. pylori population diversity, a combination of deep sequencing of isolates from the antrum and corpus of human stomachs, and comparative analysis of consensus assembled genomes and multiple single colony isolates from the same populations, was applied. Gastric biopsies were obtained from the antrum and corpus of 18 infected individuals (median age 63.5 years, range 40-79 years, seven males and 11 females). H. pylori was cultured from the biopsies as confluent population sweeps, and this growth was streaked out for isolation of single colonies. Paired gastric isolates were recovered from the antrum and corpus of 15 patients, whereas isolates were only recovered from one site for the remaining three patients. Paired biopsies with markedly different histological inflammation and damage severity (Sydney scores), and/or differing H. pylori vacA, cagA and cagE virulence factor PCR genotypes, between antrum and corpus, were prioritized for inclusion in this study which aimed to utilize a combined approach of deep population and single colony isolate sequencing. The H. pylori population growth and 119 single colonies were sequenced using the Illumina MiSeq platform. Annotation was guided by the reference strain H. pylori 26695. The total data set consisted of deep sequenced populations from 33 H. pylori population sweeps (15 paired antral and corporal populations, and from three single location populations) and 119 genomes from single colony isolates extracted from the populations of 11 different patients

Genome alignments revealed larger scale differences between antrum and corpus populations
Noticeable gaps or stretches with <95% BLASTN identity were identified between patients with a b c Figure 1. Representative whole-genome alignment for H. pylori strains isolated from patient 322. Whole-genome alignments of the consensus genomes generated by deep population sequencing of antrum and corpus H. pylori populations, using antrum (a) or corpus (b) consensus genome as the reference. Panel c depicts the alignment of colony isolate assembled genomes against the patient reference (assembly created by combining the curated reads from both antrum and corpus regions). Colour intensity of each ring indicates percentage identity between antrum and corpus consensus genomes. Positions of contigs within the assembled reference genome are shown as a ring alternating in colour between yellow and black in panels a-b to show contig boundaries. Blue bars in the coverage ring show regions >700X. Alignments for the other patients are shown in Suppl. Fig. 1 -16. paired antrum and corpus aligned population consensus assemblies from 12/15 patients ( Fig. 1 A-B; Suppl. Fig. 1-14 A-B). There were more nonsynonymous mutations between paired antrum and corpus populations suggesting that selection pressures were acting independently within and between these regions, resulting in more nonsynonymous mutations between the populations (Suppl. Fig. 17).
For the analysis of isolates from multiple regions of the same stomach, a patient-specific combined antrum and corpus consensus assembly was de novo assembled using all deep population sequencing reads from the antrum and corpus datasets and used as the reference sequence ( Fig. 1C; Suppl. Figure 5C, 7C, 9C, 10C, 11C, 12C, 13C, 14C). This was done to reduce the reference bias. For the two patients where paired antrum and corpus deep sequencing reads were unavailable as only one region was deep sequenced, the population consensus assembly was used as the reference to compare the single colony isolates (Suppl. Fig. 15-16).
Single colony isolate genomes from the antrum and/or corpus of 11 patients were aligned and visualized using high BLASTN identity between 96% and 100% ( Fig. 1C; Suppl. Figure 5C, 7C, 9C, 10C, 11C, 12C, 13C, 14C, 15,16) to identify small scale differences between the aligned genomes. Defined stretches of BLASTN identity were evident between single colony isolates from the antrum and corpus of each patient. This "visual fingerprint" could differentiate the isolates taken from the antrum and corpus, suggesting stomach region-specific adaptation and between-region differences.
While Figure 1 and supplementary figures 5C, 7C, 9C, 10C 11C, 12C, 13C, 14C, 15 and 16 identified differences in aligned single colony genomes from paired antrum and corpus regions, the pangenome analysis revealed a "fingerprint" of accessory genes associated with each patient (Suppl. Fig. 18). There were very minor differences between single colony isolates from the same patient in their accessory genes including between isolates from the antrum and corpus regions. This suggests that most patients in our dataset were infected by a single strain and diversity between strains was likely due to mutations accumulating over the time of infection. The exception to this was patient 565 in which 4/6 single colony isolates showed a different accessory genome fingerprint in comparison to the other strains (2/6 from the corpus and 6/6 from the antrum region) which had the same accessory gene fingerprint. This suggests that patient 565 was likely colonized by a multi-strain infection.

Deep sequencing and read mapping identified the most variable genes within H. pylori populations
A read mapping and polymorphic detection pipeline was developed (Table 1), to initially investigate within-region diversity using stringent thresholds to identify alleles with very high confidence, including elevated allele fractions, which were defined as 'common alleles' within datasets (Figure 2; Suppl. Fig. 19). Stringency thresholds were then lowered for some parameters (particularly the number of alternative reads mapping in the forward and reverse direction) to detect "minor" allele diversity (Table 1; Figure 2; Suppl. Fig. 20). "Minor" alleles were present at lower frequencies in the population, while "common" alleles were present at higher frequencies and represent polymorphisms that may be approaching fixation in the population or represent genes/bases under more recent selective pressures. 18 A total of 7,920 common allelic variants were detected within 1011 genes from 33 deepsequenced samples. Excluding hypothetical proteins, the highest common allelic variation (  Table 1) with polymorphisms in these genes detected in H. pylori populations isolated from 22/33 patient biopsies. Polymorphisms in virulence-related genes including the vacA paralogue HP0289 (n = 5), babA (n = 5), cagA (n = 4) and sabA (n = 4) were also identified in the common allele dataset. Other notable genes with high allelic diversity within the common allele dataset were those encoding the glutathione-regulated potassium-efflux system protein HP0471 (n = 5), DNAdirected RNA polymerase subunit beta/beta (HP1198; n = 5) and acetyl-CoA acetyltransferase (HP0690; n = 5).
The length of H. pylori colonization time (years) was estimated for each patient and stomach location using the deep population sequencing data and the mutation rate determined by Kennemann et al (2011) 26 and denoted in Suppl. Table 3 Table 3).

Combining population deep sequencing, read mapping and genome alignment allows more powerful analysis of H. pylori population structure and diversity
Heat maps were produced ( ) and these highlighted the number of differences detected by each approach. Some genes, for example cagY (HP0527) in patient 265, were identified as having differences between the antrum and corpus regions by genome alignment but also had polymorphic diversity within one or both region(s) detected by the minor allele calling pipeline. Other genes were different between the regions (by consensus genome alignment) but showed no within niche diversity at either region e.g. dnaE (HP1460) in patient 120. This shows how both techniques can be used in tandem to better capture the genetic diversity both within and between populations of H. pylori. Table 1. Description of the steps and software used in the common and minor allele calling pipelines. We adopted the workflow from Lieberman et al (2014) 18 with the following format; remove adapter read-through (Trimmomatic 20 ), trim low-quality bases from reads (Sickle 21 ), align reads (Bowtie2 22 ), call potential variants (FreeBayes 23 and vcflib 24 ). For the minor allele calling pipeline we adopted the thresholds described by Lieberman et al (2014) 18 and adapted them for the 'common allele' calling pipeline to identify alleles that were of high confidence (additional manual trimming of SNPs and SNPs supported on both forward and reverse reads) and consisted of elevated allele fractions (15 or more alternative allele calls from reads aligning in both the forward and reverse direction).

Common allele calling pipeline
Minor allele calling pipeline Sequencing reads were trimmed for Illumina sequencing adapters and adapter read-through using Trimmomatic. 20 Reads were further trimmed with Sickle 21 to Phred 30 with a minimum read length of 50 bp Sequencing reads were trimmed for Illumina sequencing adapters and adapter read-through using Trimmomatic. 20 Reads were further trimmed with Sickle 21 to Phred 30 with a minimum read length of 50 bp Bowtie2 22 (version 2.3.4.3) used to map curated paired-end reads to the consensus assembled genome in '"very sensitive mode" with a maximum fragment length of 2000 bp and the number of ambiguous characters allowed between an aligned read was reduced to ~1%.
Bowtie2 22 used to map curated paired-end reads to the consensus assembled genome in '"very sensitive mode" with a maximum fragment length of 2000 bp and the number of ambiguous characters allowed between an aligned read was reduced to ~1%. SAMtools suite 25 (version 1.9) was used to sort sequence alignment and remove PCR duplicates.
SAMtools 25 suite was used to sort sequence alignment and remove PCR duplicates. FreeBayes 23 (version 1.3.1) was used with a mapping quality of Phred 34 and base quality of Phred 30.
FreeBayes 23 was used with a mapping quality of Phred 34, base quality of Phred 30 and the minimum alternative fraction of reads to support an alternative allele was set to 3%. Alternative alleles were called only when ≥15 supporting reads were observed in the forward and reverse direction through vcflib 24 (version 1.0.0). Insertions and deletions were also filtered out. SNPs located in the first or last 500 bp of an assembled contig were removed manually. SNPs were manually filtered out if they were identified within a repeat region of ≥6 nucleotides where the alternative allele was >3 bases from the beginning of an individual read.
Insertions and deletions were also filtered out through vcflib. 24 SNPs located in the first or last 500 bp of an assembled contig were removed manually. Figure 2. This is a filtered version of Suppl. Fig 19,20 to highlight the most diverse genes within our dataset. This heat map shows the common and minor allelic variation across genes (indicated as "HP" gene IDs in line with the nomenclature) for each antrum-and corpusderived H. pylori population, derived from deep sequencing of the population from each biopsy and read-mapping back to the consensus genome to identify variant bases at each locus. Higher color intensity indicates a larger number of variant bases within that gene. Only variable genes represented by two or more patient samples were selected in the common allele variation dataset. Due to the high number of allelic genes in the minor allele variation, genes were filtered to highlight minor allelic variant genes shared between six or more different patient samples. In both allele calling datasets, sample 565C was excluded due to the extreme variation observed, likely due to this sample harboring a mixed strain infection. For an undocketed heatmap, including sample 565C, see Suppl. Fig. 19,20. A list of the gene product descriptions for each gene ID (HP number) can be found in Suppl. Table 1. A list of the minor allelic variants can be found in Suppl. Table 2. . Each column contains data from one patient. Within each column, three datasets are shown. From left to right these are: between stomach region variation from whole-genome alignment antrum versus corpus; within antrum minor allele variation; within corpus minor allele variation. Darker color intensity indicates a larger number of variant bases within that gene/gene product. Gene products/genes that were shown to be diverse six or more times by any methodology across the patient datasets were included in this figure to reduce figure size and to highlight genes/gene

Single colony sequencing added value by allowing phylogenetic and pan-genome analyses
Phylogenetic trees were constructed for the single colony isolates from patients with isolates from paired antrum and corpus regions ( Figure 4). While the antrum and corpus isolates from some patients separated out into different clades, this was not always the case (e.g. Figure 4b, 4f). There was evidence of migration of H. pylori between antrum and corpus in 5/9 patients with one or more isolate-(s) clustering in the opposite region's isolate cluster.
Pan-genome analysis confirmed that the isolates from each patient clustered together (Suppl. Fig. 18), indicating that most patients had originally been infected with a single strain that subsequently diversified. The exception was patient 565 in whom two distinct H. pylori clades were evident suggesting a mixed strain infection.

Population deep sequencing with read mapping is much more sensitive for detecting polymorphisms than conventional single colony sequencing, but combining both methods is best
Single colony isolates were used to validate the deep sequencing minor allele calling pipeline. Using a read mapping approach and exact base matching to compare methodologies (minor allele calling of deep sequenced populations and SNP calling between single colony isolates) we determined an average of 68.69% (95% CL: 59.13-78.25%) alleles were detected only by the population deep sequencing, 8.13% (95% CL: 4.83-11.43%) by the single colony isolate analysis and 23.18% (95% CL: 14.93-31.43%) were concordant between methodologies (Suppl. Fig. 22).

Discussion
By employing a combination of deep and single colony sequencing approaches in H. pylori for the first time, we detected extensive diversity within and between bacterial populations from the antrum and corpus regions of patient stomachs, including in virulence and colonization-associated genes. This combined approach generated a richer dataset than either individual approach.
Of the studies that have investigated genetic diversity of H. pylori within individual patients, 17 In this study, we used deep sequencing to detect much higher levels of genetic diversity in H. pylori populations than has previously been reported, even when very stringent parameters were applied (Table 1; Figure 2). When the parameters were relaxed to detect minor allelic variants (Table 1; Figure 2), a much larger number of polymorphic sites were detected. This study presents a comprehensive snapshot of H. pylori genetic diversity at a single point in time.
Some of the most frequently identified common allelic variant genes were related to virulence and colonization, e.g. OMPs, vacA paralogue HP0289, babA (HP1243), and cagA (HP0547). In comparison, many of the minor allelic variants were in OMP genes which make up approximately 4% of the H. pylori coding genome 37 and are highly diverse and polymorphic. [37][38][39] Most studies to date have studied polymorphic variation in H. pylori OMPs between patients, geographical regions, sequential isolates from animal models or familial isolated strains to reach these conclusions. 27,38,40,41 The comparative genetics approaches of such studies have contributed to the identification and understanding of OMP polymorphic diversity, but polymorphic diversity has rarely been identified 36 within populations taken from the same time point, despite the identification of OMP phase variation and gene products with higher observed genetic diversity. Hypothetical proteins and intergenic nucleotide diversity were removed from this dataset to improve visualization. Only patients with paired antrum and corpus data were included. Patient 565 was excluded due to the extreme variation observed, likely due to this sample harboring a mixed strain infection. A undocketed figure including patient 565 can be found in Suppl. conversion. [40][41][42] This is perhaps hampered by the difficulty and increased workload in isolating single colonies from population sweeps, the lack of paired biopsy samples from the same patient and the increased sequencing costs such investigations incur.
cagA gene diversity between individuals and geographical regions 8,43 is well characterized. In this study, we evidence both within-and betweenstomach region genetic diversity of cagA (HP0547; Figure 2; Figure 3; Suppl. Fig. 19-21) which supports the observations of other studies. 17,36 Withinand between-region diversity in cagA sequences could result in variations in virulence activity within the stomach, thus influencing gastritis patterns and disease development. Within-region polymorphic diversity of vacA paralogues HP0289, HP0610 and HP0922 was also observed. These genes are thought to play roles in host colonization and collagen degradation. Diversity here could therefore impact on colonization, persistence, and disease outcomes.
Several important genes and groups of genes were identified within both the common and minor allele pipelines (Figure 2; Suppl. Fig. 19-20). These included the DNA-directed RNA polymerase beta subunit (HP1198/rpoBC). Certain mutations within the rpoB gene of H. pylori have been shown to increase resistance to rifamycins. 44,45 We also observed diversity in the clarithromycin resistance-associated gene kefB (HP0471). 46,47 These observations might help to explain eradication therapy failure within patients whereby minority-resistant strains persist within the population. Methyl-accepting chemotaxis associated genes (HP0082; HP0103; HP0099; HP0599) were also identified (including population consensus genome alignment-based SNPs in patient 45). These genes are important for H. pylori colonization and survival, as tlpA (HP0099) senses arginine, bicarbonate, and acidic pH, 48,49 and tlpB (HP0103) is essential for chemotaxis away from acidic pH and toward more favorable conditions. 50 Such allelic diversity could affect the sensitivity of chemotactic responses and enable colonization of different areas of the stomach by these strains.
In addition, restriction-modification system genes were identified as highly allelic. Furuta et al (2015), 51 observed similar genetic diversity among restrictionmodification associated genes using a comparative genetics approach with single colony isolates obtained from five families. Other studies have also observed restriction-modification system diversity between strains taken from different patients. 11,52,53 Bacterial restriction-modification systems confer protection against invading foreign DNA 54 but do not pose a barrier to homologous recombination. 55 Restriction-modification systems can also influence gene expression 56,57 and may have roles in adhesion and virulence. 58,59 Diversity in restrictionmodification genes could therefore play a role in niche adaptation and persistence.
Lipopolysaccharide biosynthesis associated genes 60 were also highly polymorphic in the common and minor allele pipelines (Figure 2; Suppl. Fig. 19-20). 18 samples had within-region minor allelic diversity amongst genes of the putative outer membrane biogenesis complex components. 60 While we have focused on differences and diversity, our dataset could equally be used to identify useful regions of conservation that might be exploited in vaccine development. For example, genetic interruption of the HP0270 gene in H. pylori, a homolog of the LpxM protein involved in lipid A biosynthesis, has been shown to be lethal. 61 Our data suggest that HP0270 is conserved as no allelic diversity was observed within any samples in our minor allele calling pipeline. Other such examples are likely to be extracted from this dataset which might be useful for vaccine design.
Estimation of the length of time H. pylori populations had been colonizing at the time of sampling (Suppl. Table 3) was likely to be an underestimate due to the stringency of the minor allele calling pipeline (Table 1). This was backed up by the single colony isolate SNP analysis (Suppl. Fig. 22) which identified 8.1% of SNPs not called by the minor allele pipeline. Furthermore, selection pressures acting on the populations over time are likely to add to this underestimation. Our dataset suggested that the average population showed a signature of mutations consistent with a diversification over 3.1 years (range 0.07-21.7 years; Suppl. Table 3). This is consistent with many other studies. 17,26,27 However, some populations, such as the antrum and corpus locations of patients 565 and 732, estimated a length of infection as long as 289 years. It is speculated that these patients were likely infected with a mixed strain infection which has diversified over time. In these cases, the snapshot of the length of diversification is likely to be an overestimation due to homologous recombination acting between these different strains. 27 These observations might have been overlooked if only a single colony sequencing approach was used as deep sequencing was able to snapshot an abundance of allelic diversity within the populations.
Our combined approach revealed that some genes identified with between-region diversity by genome alignment also had polymorphic diversity within one or both regions of the stomach (Figure 3; Suppl. Fig. 21). For example, the iron regulated OMP gene (HP0876) of sample 265C was polymorphic within this population but was also identified by whole-genome alignment as variable between the consensus genome sequences of the antrum and corpus-derived samples from patient 265. This observation was not uncommon across the dataset.
The phylogenetic relationships between single colony isolates obtained from the antrum and corpus in this study ( Figure 4) agree with Ailloud et al (2019). 17 There were distinct clades of H. pylori between the antrum and corpus in some patients, and there was evidence of migration between the two sites in others. Ailloud et al (2019) 17 showed that migration of H. pylori strains from the antrum to the corpus was relatively infrequent, whereas migration between the corpus and fundus is a more common event, perhaps due to the more significant environmental differences between the antrum and corpus regions. Our phylogenetic analysis also indicated that where an antrum cluster was present, corpus strains were more frequently observed within antrum clades than vice versa. This suggests that while migrations between the antrum and corpus are infrequent, the corpus isolates are more likely to migrate to the antrum than vice versa. Again, this could be due to the differences between the antrum and corpus environments where antrum isolates are less fit or poorly adapted to colonize the harsher oxyntic epithelium whereas the corpus strains are able to colonize the more neutral antrum glands.
An alternative explanation for the non-clustering of antrum and corpus strains in patients 295 and 444 could be due to the biopsy sampling location and method. Fung et al (2019), 62 showed how founder strains initially colonize glands then spread to adjacent glands in the immediate vicinity. This creates islands of closely related H. pylori strains, and where island boundaries occur the inhabitants may then compete for space. At these boundaries there are glands containing a mixture of strains, or adjacent glands containing different strains side by side. Transition zones between the antrum and corpus typically contain mixed populations. The size of a biopsy (how many strain islands it spans) and its proximity to the antrum-corpus transition zone may influence the H. pylori diversity observed. This could potentially disrupt the genetic clustering and phylogenetic topology of the antrum and corpus isolates. Therefore, standardized sampling locations within the antrum and corpus would be beneficial for this type of analysis. However, this may not be feasible in practice because biopsies are often taken from areas likely to harbor H. pylori infection, such as adjacent to visually diseased epithelium.
The pan-genome analyses of the single colony isolates and the consensus assembled deepsequenced populations, both showed that the strains from each patient had a unique pattern of gene presence and absences, distinct from the patterns observed in strains from other patients (Suppl. Fig. 18). The accessory genome "fingerprint" identified between patients appeared to be largely core genes between strains taken from the same stomach, regardless of the stomach location. However, while the H. pylori strains from patient 565 all clustered together, there was substantial gene content differences between single colony isolates from this patient. The corpus isolates from this patient were much more diverse than the antrum isolates, not just allelically but also in their accessory genes. Consequently, patient 565 may have been infected by more than one H. pylori strain that exchanged DNA through homologous recombination and/or natural transformation over time and now share a similar genetic makeup but the population is much more genetically diverse in comparison to a single strain infection.
By combining conventional sequencing of single colony isolates from the antrum and corpus with population deep sequencing of the same samples, from multiple patients, we were able to comprehensively characterize H. pylori population diversity. The dual approach allowed for comparative analysis to determine how well the data generated from each approach agreed with each other. This confirmed that the deep sequencing allelic calling pipeline was able to detect 91.87% of the SNPs, with only 8.13% of SNPs identified from the single colony pipeline alone (Suppl. Fig. 22). The population deep sequencing minor allele calling methodology captured a more comprehensive snapshot of population genetic diversity compared to the single colony approach. However, the minor allele calling pipeline is still likely to be an underestimate of the true diversity present, due to the stringent quality control parameters that were applied, and single colony isolate sequencing from the same populations added value because it enabled additional phylogenetic and recombination analyses that would not have been possible using the deep sequencing data alone.
Combining the two complementary approaches of deep population sequencing and single colony isolate sequencing generated more information than either individual strategy. But this combined approach was intensive and was only applied to a relatively small number of UK-based patients in this study. This study did not have sufficient sample size to identify any significant associations between genomic traits in the H. pylori populations and the presence of ulcers, intestinal metaplasia, or severe inflammation in the patients' stomachs. This study also relied on a culture-based approach, so may not provide a full picture of the genetic diversity of strains present in the stomach due to the selection pressures applied by culture on agar plates prior to sequencing. Although genetic changes over time can be inferred from the data, this study was not longitudinal.
In conclusion, we have shown that single colony analysis alone can identify fixed differences in the genomes of H. pylori between stomach regions, but population deep sequencing reveals that underlying variation is still present and the population as a whole retains a high degree of potential plasticity. This may help explain why H. pylori can persist in a chronic infection. Identifying loci with minimal variation might usefully inform future vaccine design, while loci in which high numbers of minor allelic variants are concentrated indicate which genes are critical for niche adaptation and persistence.

Materials and methods
A summary of the methods used in this study is presented in Figure 5.

Clinical samples
Gastric biopsies were donated by 18 H. pyloriinfected patients (median age 63.5 years, range 40-79 years; 38.9% male) attending the Queen's Medical Center, Nottingham, UK, for routine upper GI tract endoscopy to investigate dyspeptic symptoms. Written informed consent was obtained, and the study was approved by the Nottingham Research Ethics Committee 2 (08/H0408/195). Patients were excluded if taking antibiotics, proton pump inhibitors or >150 mg/day aspirin, 2 weeks preceding the endoscopy. Sections cut from fixed, paraffin embedded biopsies, were stained with hematoxylin and eosin and assessed by an experienced histopathologist who was blinded to all other data. Modified Sydney scores for inflammation, activity, atrophy, and intestinal metaplasia were provided. 63 Isolates underwent PCR genotyping for the virulence genes vacA, cagA and cagE as described previously. 64,65 Patients with variable Sydney scores and/or H. pylori virulence genotypes between their antrum and corpus biopsies were prioritized for inclusion in this study. Biopsies taken from the antrum and corpus of 15 patients, along with single biopsies from one region of 3 patients were swept across blood base #2 agar plates containing 5% (v/v) horse blood and incubated for 2-3 days at 37°C under microaerophilic conditions (10% CO 2 , 5% O 2 , 85% N 2 ). H. pylori growth from each biopsy sweep was picked and pooled into isosensitest broth containing 15% (v/v) glycerol for long-term storage at −80°C. For single colony isolation, pooled cultures were streaked out to obtain multiple single colony isolates. These were picked at random and passaged one to two times to increase bacterial numbers for long-term storage.

DNA extraction and whole-genome sequencing
Genomic DNA from population sweeps and single colony isolates was extracted using the QIAGEN QIAmp DNA Mini Kit following manufacturer's instructions. DNA quality was determined using the NanoDrop2000 spectrophotometer using strict absorbance ratio cutoff values of A260/A280 1.8-1.9 and A260/A230 1.9-2.2. Genomic DNA was quantified using the dsDNA high sensitivity assay on a Qubit v4.0 fluorometer and diluted to 0.3 ng/μl for creation of Nextera XT paired-end libraries. Sequencing used Illumina V3 chemistry cartridges spiked with 1% PhiX DNA run at 2 × 250 cycles on the MiSeq platform.
The raw sequencing data generated in this study were deposited on the NCBI website with the accession code PRJNA787419. Figure 5. Overview of study design. H. pylori isolates from corpus and antrum biopsies were subjected to population deep sequencing (right; b1-6) and streaked out to single colonies which were individually sequenced (left; a1-5). Data from the two approaches were used to analyze H. pylori diversity within and between stomach regions and individuals. a1 -isolation of single colony isolates; a2single colony isolates sequenced to ~30x depth; a3 -single colony isolate genomes aligned to investigate between strain diversity with orange nucleotide bases representing SNPs; a4 -single colony isolate genomes used to create phylogenetic trees; a5 -gene presence and absence analysis. b1 -culture of H. pylori populations from each biopsy; b2 -population deep sequencing to >100x depth; b3 -within population diversity investigated by mapping reads to the consensus genome (green colored nucleotides) with using Trimmomatic 20 (version 0.38). Reads were further trimmed with Sickle 21 (version 1.33) to Phred 30 with a minimum read length of 50bp. Trimmed reads were inspected by FastQC 66 0.11.7 for confirmation of expected trimming. Curated reads were passed through Kraken 67 1.0 and the MiniKraken database (RefSeq complete bacterial, archaeal and viral genomes captured on 18/10/ 2017) to detect contaminated samples, defined as <92% of curated sequencing reads mapped to H. pylori and/or <95% of the reads mapped to the Helicobacter genus. Two deep-sequenced samples (308A and 326A) contained contaminant reads from an unknown species.

Sequencing read curation, contamination detection, whole-genome assembly, and annotation
Curated reads were used to construct de novo assemblies using the SPAdes 68 3.11.1 assembler in careful mode creating consensus whole-genome assemblies for the deep-sequenced clinical sweep samples and contigs <500 bp were removed. For analysis of strains from multiple regions of the same stomach ( Fig. 1C; Suppl. Figure 5C, 7C, 9C, 10C, 11C, 12C, 13C, 14C, 15, 16), a patient-specific combined antrum and corpus consensus assembly was de novo assembled using the same settings by combining the curated reads from both antrum and corpus regions. This was done to reduce the reference bias associated with using either the consensus assembly from the deep sequenced antrum or corpus populations. For strains isolated from patients where only one region was population deep sequenced, the population consensus assembled genome sequence from that region was used as the reference genome. Sequencing depth/coverage was determined using mosdepth 0.2.3. 69 Contaminant reads were assembled into separate contigs and excluded from all further analyses with the exception of BLAST ring image generator (BRIG; version 0.95) analysis 19 in Suppl. Fig 8-9.
Consensus population sweeps and single colony assemblies were annotated using Prokka 70 (version 1.13) guided by the reference H. pylori strain 26695 (NC_000915.1) with an e-value threshold of 0.001 to account for gene diversity. A list of the gene product descriptions for each gene ID (HP numbers) can be found in Suppl. Table 1.

Within-region diversity of deep-sequenced bacterial sweeps
A read mapping and polymorphic detection pipeline was developed similar to that of Lieberman et al (2014) 18 using identical thresholds for minor allele calling and adapted thresholds to identify alleles with very high confidence thresholds including elevated allele fractions, which were defined as 'common alleles' within datasets. An overview of the pipeline developed for this study is denoted in Table 1.
Validation of the minor allele calling pipeline was achieved by mapping the curated single colony sequencing reads of the antrum and the corpus to the corresponding clinical sweep deep-sequenced consensus assembled reference genomes from each region (19 regions total) using Snippy 71 4.4.0 with an alternative allele support >90%, coverage of ≥6, minimum mapping quality of Phred 34 and minimum base quality of Phred 30. Data from 565C were excluded due to the potential presence of different H. pylori strains, 565C1 and 565C6, which had an unusually high number of contigs (565C1 -1,501; 565C6 -1,295), low N50 statistics (565C1 -3,665; 565C6 -3,977) and large genome lengths (565C1 -~2.62 Mbp; 565C6 -~2.58 Mbp).
Within-region diversity was further investigated using the single colony isolates from each patient and analyzed through the BRIG 19 ( Fig. 1C; Suppl. Figure 5C, 7C, 9C, 10C, 11C, 12C, 13C, 14C, 15,16). Briefly, the single colony isolate genomes from both the antrum and/or corpus of each patient were aligned 72 to the patient reference consensus genome which was assembled as previously described using all deep population sequencing reads from the antrum and corpus datasets (population consensus reference genome if only one location was sequenced) and run through BRIG with a lower identity threshold of 96%. Where there were gaps minor (blue boxes) and common (orange box) allelic diversity highlighted; b4 -between stomach region diversity analyzed by aligning the consensus genome assemblies from each regions' population; b5 -BLASTN identity between stomach region populations investigated using the consensus assembled population genomes, highlighting areas of high diversity using BRIG; 19 b6 -pangenome analysis.
in the query genomes, the genes were annotated and overlaid to highlight the missing genes.
Length of H. pylori colonization time (years) was estimated for each patient and stomach location using the deep population sequencing data and the mutation rate determined by Kennemann et al (2011) 26 and denoted in Suppl. Table 3. Briefly, the number of observed minor allelic variant positions for each dataset were divided by the number of mutations per site per year (41.7 mutations per year in a genome size ~ 1.63254 Mb). The following equation was used where X is the total number of minor allelic sites identified by the minor allele calling pipeline for each population:

Between-region diversity
BRIG was also used to investigate diversity between stomach regions. Paired antrum and corpus consensus assembled genomes from the deep population sequencing were used, and BLASTN identity was set to 95% to account for more potential diversity in this dataset. The antrum and corpus consensus genomes for each patient were used as the reference sequence, with the alternative region used as the query sequence in order to fully investigate betweenregion genomic diversity. Where there were gaps in the query genomes, the genes were annotated and overlaid to highlight the missing genes. Mauve 72 was used to align paired antrum and corpus region population consensus genomes for each patient and the alignment SNPs between the aligned genomes were exported.
Between-region diversity was further investigated using a consensus and read mapping alignment approach depicted in Suppl. Fig. 23 and was used to identify alignment SNPs of high confidence by comparison to the alignment SNPs identified by the Mauve alignment. These curated, between region population consensus genome alignment SNPs were used to compare against the minor allele calling pipeline results displayed in Figure 3.
High confidence alignment SNPs identified by consensus genome alignment and read mapping (Figure 3; Suppl. Fig. 23) were passed through a custom build of SnpEff 73 4.3 to determine whether alignment identified SNPs between paired antrum and corpus region population consensus genomes were synonymous or nonsynonymous (Suppl. Fig. 17).
Core genome phylogenetic trees were constructed from patients with paired antrum and corpus single colony isolates through Snippy 71 4.4.0 and sites of recombination were detected and removed via Gubbins 2.3.1. Polymorphic sites between the aligned genomes were extracted to create a SNP alignment file using the SNP-sites tool 74 2.4.1. The SNP alignment-based phylogeny was constructed using FastTree 75 2.1.1 which infers approximately maximum-likelihood trees.
Pan-genome analysis of all single colony isolates was undertaken by Panaroo 76 version 1.2.2 using default settings (Suppl. Fig. 18).

Heat maps and Venn diagrams
Heat maps were created using ggplot2 and RColorBrewer through R statistical software 77 3.5.1. "HP" gene reference numbers (Suppl. Table 1) from the reference strain 26695 (NC_000915.1) were used for nomenclature followed by the gene product information as provided by the genome annotation pipeline described above. Where there were no matches to "HP" numbers, the gene abbreviation (if applicable) was used followed by the gene product information.
Venn diagrams were created using the R statistical software VennDiagram package 1.6.20. 78 Where deep-sequenced within population allelic sites matched exactly with SNPs identified between single colony isolates and the consensus assembled genome for each patient and stomach location, the overlapping and unique SNP and allelic sites were determined. This data was used as an internal control to identify the overlap between the single colony and deep sequencing minor allele calling methodologies. We show good concordance between the two methodologies suggesting that the minor allele calling methodology has power to identify true positive variant sites within the populations.